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Abstract. 

Machine translation systems are not reliable enough to be used “as is”: except 
for the most simple tasks, they can only be used to grasp the general meaning 
of a text or assist human translators. The purpose of confidence measures is to 
detect erroneous words or sentences produced by a machine translation system. In 
this article after reviewing the mathematical foundations of confidence estimation 
we propose a comparison of several state-of-the-art confidence measures, predictive 
parameters and classifiers. We also propose two original confidence measures based 
on Mutual Information and a method for automatically generating data for training 
and testing classifiers. We applied these techniques to data from WMT campaign 
2008 and found that the best confidence measures yielded an Equal Error Rate of 
36.3% at word level and 34.2% at sentence level, but combining different measures 
reduced these rates to respectively 35.0% and 29.0%. We also present the results 
of an experiment aimed at determining how helpful confidence measures are in a 
post edition task. Preliminary results suggest that our system is not yet ready to 
efficiently help post editors, but we now have a software and protocol we can apply 
to further experiments, and user feedback has indicated aspects which must be 
improved in order to increase the level of helpfulness of confidence measures. 

Keywords: machine translation, confidence measure, translation evaluation, sup¬ 
port vector machine, mutual information, partial least squares regression, logistic 
regression, neural network 


1. Introduction 

A machine translation system generates the best translation for a given 
sentence according to a previously learnt or hard-coded model. How¬ 
ever no model exists that is able to capture all the subtlety of natural 
language. Therefore even the best machine translation systems make 
mistakes, and always will — even experts make mistakes after all. Er¬ 
rors take a variety of forms: a word can be wrong, misplaced or missing. 
Whole translations can be utterly nonsensical or just slightly flawed — 
involving missing negation, grammatical error and so forth. Therefore, 
when a document is intended for publication, machine translation out¬ 
put cannot be used ”as is”; at best, it can be used to help a human 
translator. A tool for detecting and pinpointing translation errors may 
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ease their work as suggested, for example, in (Ueffing and Ney, 2005). 
(Gandrabur and Foster, 2003) suggest the use of confidence estimation 
in the context of translation prediction. Confidence estimates could 
benefit automatic post edition systems like the one proposed in (Sirnard 
et al., 2007), by selecting which sentences are to be post edited. Even 
end-users using machine translation for grasping the overall meaning of 
a text may appreciate dubious words and sentences being highlighted, 
preventing them from placing too much trust in potentially wrong 
translations. 

However, and maybe because of such high expectations, confidence 
estimation is a very difficult problem because if decisions are to be made 
based on these estimations (like modifying a translation hypothesis), 
they need to be very accurate in order to maintain translation quality 
and avoid wasting the user’s time. Confidence estimation remains an 
active research field in numerous domains and much work remains to 
be done before they can be integrated into working systems. 

This article is an overview of many of today’s available predictive 
parameters for machine translation confidence estimation along with 
a few original predictive parameters of our own; we also evaluated 
different machine learning techniques — support vector machines, lo¬ 
gistic regression, partial least squares regression and neural networks 
(Section 2) — to combine and optimise them. An exhaustive review 
would require a whole book, therefore this paper intends to give a more 
targeted overview of some of the most significant ideas in the domain. 
(Blatz et al., 2004) proposed a review of many confidence measures for 
machine translation. We used this work as a starting point to then carry 
out a thorough formalisation of the confidence estimation problem and 
make two contributions to the field: 

— Original estimators based on Mutual Information and Part-Of- 
Speech tags (Section 6). 

— An algorithm to automatically generate annotated training data 
for correct/incorrect classifiers (Section 4.3). 

We implemented techniques described in (Siu and Gish, 1999) for 
the evaluation of the performance of the proposed confidence measures. 
In Sections 6.2, 7.2, we show that using a combination of all predictive 
parameters yields an improvement of 1.3 points absolute in terms of 
equal error rate over the best parameter used alone (Section 3.1). 

Finally, we present the results of a post edition experiment in which 
we asked volunteers to correct sentences which had been automatically 
translated and measure their efficiency with and without confidence 
measures (Section 8). Unfortunately the results suggested that this 
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confidence estimation system was not ready to be included in a post 
edition software yet. However we shall provide a number of useful ob¬ 
servations and indications about what is wrong with our system and 
what is really important for a user. 


1.1. Sentence-level confidence estimation 

We intuitively recognise a wrong translation that does not have the 
same meaning than the source sentence, or no meaning at all, or is too 
disfluent. State-of-the-art natural language processing software being 
still unable to grasp the meaning of a sentence or to assess its gram¬ 
matical correctness or fluency, we have to rely on lower level estimators. 
The problem is also ill-posed: sometimes one cannot decide what is the 
meaning of a sentence (especially without a context), let alone decide if 
its translation is correct or not (a translation can be correct for one of 
the possible meanings of the source sentence and wrong for another). 
In our experiments we asked human judges to assign a numerical score 
to machine translated sentences, ranging from one (hopelessly bad 
translation) to five (perfect) as described in Section 4.1. We set the 
confidence estimation system to automatically detect sentences with 
scores of three or higher (disfluencies are considered acceptable, insofar 
as a reader could understand the correct meaning in a reasonable time). 
To this end we computed simple numeric features (also called predictive 
parameters : Language Model (LM) score, length, etc. — Section 7) and 
combined them (Section 2). 


1.2. Word-level confidence estimation 

Defining the correctness of a word is even more tricky. Sometimes a 
translated word may not fit in the context of the source sentence, as 
may be the case when homonyms are involved (for example if the French 
word vol, speaking of a plane, is translated with the English word theft 
instead of flight). In this case the error is obvious but sometimes the 
correctness of a word could depend on how other words around it are 
translated. Consider the following example: 


Ces mots sont presque synonymes 


These words are almost 
synonyms (correct) 
These words have close 
meanings (correct) 
These words have close 
synonyms (incorrect) 
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is definitely an incorrect translation but then we have to decide 
which word is wrong: is it closel is it synonyms ? or have ? all of them? In 
the rest of the article we show how we trained classifiers to discriminate 
between correct and incorrect words, but this example shows that no 
system can ever achieve perfect classification simply because this does 
not exist. 

1.3. Mathematical formulation 

Let us now state the problem in mathematically sound terms: the goal 
of machine translation is to generate a target sentence from a source 
sentence. A sentence is a finite sequence of words and punctuation 
marks, which are elements of a set called vocabulary. The sentences are 
represented by random variables. We use the following conventions: a 
random variable will be represented by a capital letter and a realisation 
of the variable by the corresponding lower-case letter; bold letters are 
non-scalar values (sentences, vectors, matrices); non-bold letters are for 
scalar values like words and real numbers; cursive letters are sets. 

Vs : Source language vocabulary 

Vr : Target language vocabulary 

S £ VJ : Sentence in the source language 

T 6 Vy : Sentence in the target language 

From these two primary random variables we then derive new variables: 

Len(S) £ N : Length of S (number of words) 

Len(T) £ N : Length of T 

Si £ Vs : i-th word of S 

Tj £ Vt : j-th word of T 

When estimating confidence we are given realisations of these variables 
and then need to guess the values of: 

Cs,t £ {0,1} : correctness of a sentence T as a translation of S 

Cs,T,j £ {0,1} : correctness of the j-th word of T 

To this end the following probability distribution functions (PDFs) are 
required and need to be estimated: 


P(C s ,t = 1|S,T) 
P(Cs,T,j = 1|S, T) 


the probability that T is a correct (1) 

translation of S 

the probability of correctness of the j-th (2) 
word of T given that T is a translation of S 
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As S and T may be any sentence, directly estimating these probabilities 
is impossible. We therefore opted to map the pair (S,T) to a vector 
of d s numerical features (so-called predictive parameters described in 
Section 7.1) via the function x s . Similarly (S,T, j) were mapped to a 
numerical vector of d w features via x w (Section 6.1): 

x s : (S, T)eVjxV^ x s (S, T) G M ds 

and: 

x w : (S, T, j) (EV* s xVt xN^x w (S,T,j) G JR dw 

Such parameters may include, for example, the length of source and 
target sentences, the score given by a translation model or a language 
model, etc. The following PDFs are therefore learnt (the left-hand parts 
are just notations) instead of Formulae (1) and (2): 


p(Cs,t;S,T) 

= P(C s ,t|x s (S,T)) 

( 3 ) 

p(Cs,tj-,S,TJ) 

d = P(C S ,tj|x w (S, T, j)) 

( 4 ) 


Note that although it does not explicitly appear in the notation, p 
depends on the function x, which will be different in different exper¬ 
iments, and will also not be the same on sentence- and word-levels. 
These distributions were to be learnt on large data sets (described in 
Section 4) by standard machine learning algorithms (Section 2) such as 
Support Vector Machines (Cortes and Vapnik, 1995), Neural Networks 
(Fausett, 1994), Logistic Regression (Menard, 2002) or Partial Least 
Squares Regression (Tobias, 1995). 

1.3.1. Classification 

After this training process the probability estimates (Formulae (3) and 
(4)) could be used as confidence measures. It was then possible to 
compute a classification: 

c : (T, S) —> c(T, S) G {0,1} 

or at word-level:: 

c : (T, S, j) —► c(T, S, j) G {0,1} 

In order to minimise the number of errors, classification needs to be 
performed according to: 


c(T,S) 
c(T,S ,j) 


def 

arg maxp(c; S, T) 

( 5 ) 


ce{o,i} 


def 

arg maxp(c; S, T, j) 

(6) 


ce{ 0 , 1 } 
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However this is too strict and neither accounts for biased prob¬ 
ability estimates nor permits the attribution of levels of importance 
to correct rejection or correct acceptance — that is, correct detection 
of good translations versus correct detection of erroneous translations 
(see performance estimation in Section 3). Therefore we introduced an 
acceptance threshold 6: 


c(T, S; 5) 
c(T,S,j;5) 


def ( 1 

ifp(l;S,T) > 5 

(7) 

“ l 0 

otherwise 

def ( 1 

l 0 

ifp(l;S,T,j) > 6 
otherwise 

(8) 


If 6 = 0.5 Formulae (7) and (8) are equivalent to (5) and (6). But 
setting a higher <5 may compensate for a positive bias in probability 
estimates (3) and (4) or penalise false acceptances more heavily, while 
setting a lower 5 compensates for a negative bias or penalise false 
rejections more heavily. 


1.3.2. Bias 

Probability estimates of Formulae (3) and (4) are often biased. This 
generally does not harm classification performance for two reasons. 
Firstly, when the bias is uniform (p* = p + b where b is constant), 
removing the bias is equivalent to setting an appropriate acceptance 
threshold. Secondly and most importantly, these PDFs are learnt by 
minimising classification cost. It is therefore not surprising that even 
if the probabilities are biased and even if the bias is not uniform 
(p* = p+b(p)), positive examples generally obtain a higher probability 
than negative ones. 

However, biased probability estimates can harm other performance 
metrics and in particular will definitely harm Normalised Mutual In¬ 
formation (Section 3) as shown in (Siu and Gish, 1999). We therefore 
estimated bias on a separate corpora as explained in Siu’s paper. The 
interval [0,1] was split into 1000 non overlapping bins B, t of uniform 
width and bias was estimated separately on each of them. 

Mi € {1,.., 1000} . b(Bi) = ~ (9) 

where p } are the estimated probabilities of correctness of items in 
the training set dedicated to bias estimation, and c* their true classes. 
Then we obtained an unbiasing function: 
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if p £ £>, : unbias{p) = p — b(B. L ) (10) 

If p is the probability of correctness estimated by a confidence mea¬ 
sure, we chose to use the unbiased estimation in our applications: 

p(l; S, T) = unbias(p ) 


1.3.3. Sentence quality assessment 

Some applications do not require the classification of sentences as cor¬ 
rect or incorrect, but rather the estimation of overall quality of the 
translation. This would ressemble BLEU score (Papineni et al., 2001) 
or Translation Edit Rate (Snover et al., 2006) only without using 
reference translations. In this case a quality metric is more suitable than 
a correctness probability. In Section 7 we therefore present a method for 
learning PDF of Formula (3) which can also perform regression against 
quality scores. The training set for this task was: 

{ (x-(s“,t“)i N C R d - x JR+ 

where q*„ t „ is a score relying on expert knowledge. This can be a human 
evaluation, or a metric computed by comparing the sentence to expert 
given references, like Word Error Rate, BLEU or Translation Edit Rate. 
The goal is to learn the mapping /© : lR ds —> 1R + minimising the mean 
quadratic error using regression techniques (linear regression, support 
vector regression, partial least squares regression...) where 0 is a set of 
parameters to be estimated by regression: 

1 N 

-^|/ e (x-( s ",t’*))-*,,„| 2 (11) 

n =1 


1.3.4. Training sets 

PDFs (Equations 3 and 4) and regression parameters 0 in Equation 
(11) need to be learnt using large data sets. Such data sets consist of: 

— N source sentences s 1 , .., which are realisations of S. 

— The corresponding N automatically translated sentences t 1 , ..,t N 
which are realisations of T. 

— Reference sentences classes and a quality metric 
(( c si,ti’9*i it i)i"i( c siv ) tJV’Civ it iv))n=i..Ar G ({0,1} X 1R + ) N which are 
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realisations of Cs,t; they can be given by human experts (Section 
4.1) or automatically generated from human translations (Section 
4.3 and 4.2). 

— Reference words classes 

Vn G {l,..,Ar}.(c*„ )t n ) i,..,c* n)tnien(t) ) G {0, l} Len(t) which are 
realisations of Cs,T,j : also given by human experts. 


2. Classification and regression techniques 

The problem of confidence estimation is now reduced to standard clas¬ 
sification and regression problems. Many well known machine learning 
techniques are available and we opted to experiment with the well 
known Support Vector Machines, Logistic Regression and Artificial 
Neural Network, and also with the lesser-known Partial Least Squares 
Regression. 

2.1. Logistic Regression 

Here we wanted to predict the correctness C G {0,1} given a set of 
features X G M d ; to this end we needed to estimate the distribution 
P(C = 1|X). Logistic Regression (Menard, 2002) consists of assuming 
that: 

P(C =1|X) = 1 + J e ^ (12) 

for some 0 G M d and b G M and then optimise 0 with regard to the 
maximum likelihood criterion on the training data. Logistic regression 
was used not only to combine several features but also to map the scores 
produced by a confidence estimator to a probability distribution. 

2.2. Support Vector Machine 

The well-known Support Vector Machines (SVMs) (Hsu et al., 2003) 
have highly desirable characteristics which made them well suited to 
our problem. They are able to discriminate between two non-linearly 
separable classes; they can also compute the probability that a given 
sample belongs to one class, and not only a binary decision and they 
can also be used to perform regression against numerical scores (Srnola 
and Scholkopf, 2004). We used LibSVM (Chang and Lin, 2011) for 
feature scaling, classification and regression. 
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2.2.1. SVM for classification 

SVM were trained to produce a probability of correctness. By doing so 
the acceptance threshold could be adapted (Section 1.3 and Equations 
(7) and ( 8 )), making the classifier more flexible. The kernel used was 
a Radial Basis Function since it is simple and was reported in (Zhang 
and Rudnicky, 2001) as giving good results: 


tf 7 (x(s,t, j),x(s',t', j')) = e-^ll x ( s ’ t ^)- x ( s, ’ t ^) 


2.2.2. SVM for quality evaluation 

The same kernel was used but this time to perform regression against 
sentence-level BLEU score (Papineni et al., 2001). 

2.2.3. Meta-parameters optimisation 

SVMs require two meta-parameters to be optimised: the 7 parameter of 
the radial basis function, and the error cost C. 7 and C were optimised 
by grid search on the development corpus with regard to equal error 
rate for classification, and mean quadratic error for regression. 

2.3. Neural Network 

The FANN toolkit (Fast Artificial Neural Network (Nissen, 2003)) is 
used for building feed-forward neural networks (NN). After experiment¬ 
ing on a development set we decided to stick to the standard structure, 
namely one input layer with as many neurons as we have features, one 
hidden layer with half as many neurons, and an output layer made of a 
single neuron which returns a probability of correctness. The connection 
rate was 0.5 in order to keep computation time tractable. We stuck to 
the default sigmoid activation function. The weights were optimised by 
standard gradient back-propagation. 

2.4. Partial Least Squares Regression 

Partial Least Squares Regression (Wold et al., 1984; Specia et al., 
2009) is a multivariate data analysis technique that finds a bilinear 
relation between the observable variables (our features X and the re¬ 
sponse variables, namely the probability of correctness p(l; X) or the 
quality score). It works by projecting both predictors and observations 
on a linear subspace and performs least-squares regression in this space. 
It has the major advantage of being robust to correlated predictors. 
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3. Evaluation of the classifiers 

Error rate is the most obvious metric for measuring the performance 
of a classifier. It is however not an appropriate metric because of its 
sensitivity to class priors (Kononenko and Bratko, 1991; Siu and Gish, 
1999). Let us exemplify the problem and consider for example a ma¬ 
chine translation system which gives roughly 15% of wrongly translated 
words. Now let us consider a confidence measure such that: 


Vs,t, j p°(l; s, t, j) = 1 

It makes no error on correct words (85% of total) but misclassifies all 
wrong words (15%). Its error rate is therefore 0 x 0.85 + 1 x 0.15 = 0.15. 
Now let us consider a second confidence measure p 1 (l;s,t,j) which 
correctly detects every wrong word (if the j-th word of t is wrong 
then p 1 (l;s,t,j) = 0 ) but also incorrectly assigns a null probability 
of correctness to 20 % of the words that are appropriate translations. 
The error rate of this measure is: 0 x 0.15 + 0.20 x 0.85 = 0.17. 

p° therefore seems to outperform p 1 . This is however not true, be¬ 
cause p° does not provide the user with any useful information (or actu¬ 
ally any information at all, strictly speaking), while if p°(l; s, t, j) > 0 
then we would be certain that the word is correct. There is a lesson here. 
An appropriate metric for the usefulness of a confidence measure is not 
the number of misclassifications it makes but the amount of information 
it provides. This is why we opted to use Normalised Mutual Information 
(Siu and Gish, 1999) to assess the performance of a measure (Section 
3.2), along with Equal Error Rate (EER) and Discrimination Error 
Trade-off (DET) curves (Section 3.1). The latter is a powerful tool for 
the visualisation of the behaviour of a classifier with different accep¬ 
tance thresholds and therefore different compromises between incorrect 
acceptances and incorrect rejections. 

3.1. Discrimination Error Trade-off 

A classifier makes two kinds of mistakes: False acceptance , when an 
erroneous item (word or sentence) is classified as correct, also called 
Type 1 error , and False rejection or Type 2 error when a correct item 
is classified as incorrect. When evaluating the performance of a classifier 
we know the predictions c (equations (7) and ( 8 )) and the actual real¬ 
isations c* of the variables C. As stated above in Section 1.3 c(t; s; 5) 
is the estimated correctness of translation t given source sentence s 
with acceptance threshold 6 and that c* t is the true (expert-given) 
correctness (Section 1.3.4). Sentence-level false acceptance rate is: 
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ei(s,t;<5) 
erri(5) 


f 1 if c(t; s; (5) = 1 and c* t = 0 
| 0 otherwise 

E s ,t e i( s >M) 

E S)t (i - 4,t) 


(13) 

(14) 


er?’i is therefore the proportion of wrong items which are incorrectly 
accepted (E s t(l — c st) the number of wrong items). 

Sentence-level false rejections rate is: 


e 2 (s,t;<5) 
err 2 {5) 


( 1 if c(t; s; (5) = 0 and c* t = 1 

( 0 otherwise 

E s ,t e 2 (s,t;<5) 

Es,t C s,t 


(15) 

(16) 


err 2 is the proportion of correct items which are rejected by the 
classifier. Adapting these formulae to word-level is straightforward. 


Intuitively erri is the proportion of erroneous words that the classi¬ 
fiers wrongly accept, while err 2 is the proportion of correct words that 
the classifier wrongly rejects. A relaxed classifier has a low err 2 and a 
high erri while a strict one has a low erri and a high err 2 . Proof that 
erri and err 2 are insensitive to class priors was given in (Siu and Gish, 
1999). 

When 5 goes from 0 to 1, more and more items are rejected. There¬ 
fore the false rejection rate (err 2 ) monotonically increases from 0 to 
1 while the false acceptance rate (erri) monotonically decreases from 
1 to 0. The plot of erri(5) against er?’ 2 (d') is called the DET curve 
(Discrimination Error Trade-off). See examples in Section 6. 

A lower curve indicates a better classifier. All points of the DET 
curve should lie below the diagonal [(0,1), (1, 0)], which is the theoret¬ 
ical curve of a classifier using features uncorrelated with correctness 
(that is, inappropriate features). 

Both erri and err 2 are generally approximations of continuous func¬ 
tions 1 . Therefore a threshold 5eer exists such that: 


erri(5 E RR) - err2{5 EER ) = EER (17) 

EER is called the equal error rate. It can be seen as a “summary” of 
the DET curve when the acceptance threshold is set so that there are 

1 it actually depends on the true and estimated PDFs. When this is not the case 
they shall be approximated by continuous functions. 
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the same proportions of Type 1 and 2 errors, and can be used for direct 
comparisons between classifiers. However this is arbitrary, because the 
user may prefer to have fewer errors of one type, at the cost of more of 
the other type. 

3.2. Normalised Mutual Information 

Normalised Mutual Information (NMI) measures the level of informa¬ 
tiveness of a predictive parameter or a set thereof in an application- 
independent manner (Siu and Gish, 1999). Intuitively NMI measures 
the reduction of entropy of the distribution of true class C over the set 
{ “correct”, “incorrect”} when the value of the predictive parameter is 
known; let x(S, T) be a vector of predictive parameters: 


NMI(C, x) 
H(C) 
H(C\x) 


I(C-x) _ H(C) - H(C |x) 

H(C) H(C) 1 ’ 

—V* log ip*) - (1 -p*)log(l -p*) 

j(PM S,T)=v)x 

Y P{C = c|x(S, T) = v)log{P{C = c|x(S,T) = v)))dv 
ce{o,i} 


where I is mutual information, H is entropy and p* is the true prior 
probability of correctness. Because the true distribution P(x(S,T)) is 
replaced with empirical frequencies observed in data, and P(C|x(S, T)) 
is replaced with the computed estimation: 


— Sentence-level NMI: 

#(C|x) c± ^ Y (p(l;s,t)Zo#(p(l;s,t)) (19) 

(s,t)E«S 

+(1 -p(l;s,t))Zo 0 (l -p(l;s,t))) 


- Word-level NMI: 


Len{ t) 


(s,t)e5 j= 1 


H{c |x) ~ — Y (p! 1 ; s, s, t, j)) ( 20 ) 


+(1 -p(l;s,t, j))log{l -p(l;s,t,j))) 
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H[C |x) can never be lower than 0 and equality is achieved when 
for all pair of sentences (or all words within), p(c S) t;s,t) = 1 which 
means that the true class is predicted with no uncertainty. On the 
other hand H(C |x) can never be greater than H(C) and equality is 
achieved when the predictive parameters are completely useless. There¬ 
fore M(x) is theoretically a real number between 0 and 1. However the 
approximation of H(C |x) can be negative in practice. 


4. Training and testing data 

Large data sets are needed to learn PDFs of Formulae (3) and (4). 
Ideally a human professional translator would read the output of a 
Machine Translation system and assign a label (correct or incorrect ) 
to each item. This method would give high quality training data but 
would be extremely expensive. Therefore it would be preferable to use 
automatic or semi-automatic methods for efficiently classifying words 
and sentences. In the following we will discuss different methods for 
obtaining labelled data. 

4.1. Expert-Annotated corpora 

This is the high-quality-high-cost whereby human experts analyse trans¬ 
lations produced by a machine translation system and decide if each 
word and sentence is correct or not. The classification depends on the 
application, but in our setting a word is classified as erroneous if it is 
an incorrect translation, if it suffers from a severe agreement error or 
if it is completely misplaced. A sentence is considered wrong if it is 
not clear that it has the same meaning as the sentence of which it is 
supposed to be a translation, or any meaning at all, or if it contains 
a significant level of ambiguity that was not apparent in the source 
sentence. This method has two major drawbacks. The first is that it is 
extremely slow and therefore expensive, and the second is that it is not 
reproducible because a given sentence may be differently classified by 
different translators, or by the same translator at different times. 

We needed a small corpus of real, expert-annotated machine-translated 
sentences for our test set. To this end we set up the statistical ma¬ 
chine translation system described as the baseline for WMT08 eval¬ 
uation campaign following the instructions on StatMT website 2 : it 
features a 5-gram language model with Kneser-Ney discounting trained 
with SRILM (Stolcke, 2002) on about 35 million running words, IBM- 
5 translation model trained on around 40 million words, and Moses 

2 http://statmt.org/wmt08/baseline.html 
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(Koehn et al., 2007). A hold-out set of 40,000 sentence pairs was 
extracted from data for the purpose of training the confidence estima¬ 
tion system. We annotated a small set of 150 automatically translated 
sentences from transcriptions of news broadcast. Because of these sen¬ 
tences’ spontaneous style and a vocabulary which did not match that 
of training corpora (European Parliament) the BLEU score is not 
high (21.8 with only one reference). However most translations were 
intelligible when given some thought. 

A word was annotated as “incorrect” if it was completely irrelevant, 
very misplaced or grammatically flawed. Sentences were given scores 
ranging from one (hopelessly flawed) to five (perfect). For classification 
purposes we considered sentences scoring three or higher (possible to 
get the correct meaning when given a little thought) to be correct. 

Here are a few examples of expert-annotated sentences (the incorrect 
words are underlined): 


Source sentence 

Machine translation 

score 

je vous remercie monsieur le commis- 
-saire pour votre declaration. 

thank you mr commissioner 
for your question. 

2 

j’ai de nombreuses questions a poser 
a m. le commissaire. 

i have some questions to ask 
to the commissioner. 

4 

les objectifs de la strategie de lisbonne 
ne sont pas les bons. 

the lisboa strategy mistaken. 

3 


4.2. Automatically annotated corpora 

An intuitive idea is to compare a generated translation to a refer¬ 
ence translation, and classify as correct the candidate words that are 
Levenshtein-aligned to a word in the reference translation (Ueffing 
and Ney, 2004). However this is too strict and many correct words 
would be incorrectly classified, because there are often many possible 
translations for a given source sentence and these may have nothing 
in common. This problem can partly be overcome by using multiple 
reference translations (Blatz et al., 2004). However multiple references 
are not always available and are costly to produce. 

4.3. Artificial training data 

In this section we present an algorithm aimed at getting the best of 
both worlds, namely automatically generating sentences (no humans 
involved, quickly generating huge amounts of data as with automatic 
annotation), without any annotation error (no errors in gold stan¬ 
dard classes as with human annotation). Our objective was to gen¬ 
erate enough data for training classifiers in order to combine several 
predictive parameters. 


cm-springer-utf8.tex; 22/06/2011; 16:20; p.14 




15 


Starting from human-made reference translations, errors were auto¬ 
matically introduced in order to generate examples for training con¬ 
fidence measures. Given an English sentence t (which is a correct 
translation of source sentence s), we first chose where to introduce 
errors. As machine translation errors tend to be “bursty” (not evenly 
distributed but appearing in clusters) we implemented two error models 
whose parameters were estimated on a few human annotated sentences. 
These annotations were not required to be extremely precise. 

Bigram error model: firstly we implemented a simple bigram 
model P(Ci\Ci-i)\ the probability that a word is correct given the 
correctness of the preceding word. The first word in a sentence has 
an a priori probability of being correct. According to this model we 
generated sequences of ones and zeroes corresponding to correct and 
incorrect words. We found that nine sentences out of ten in our human 
annotated test set started with a correct word, that a wrong word 
had approximately a 50% chance of being followed by another wrong 
word ( P(Ci = 0|Ci_i = 0) ~ 0.5) and that a correct word had ap¬ 
proximately a 90% chance of being followed by another correct word 
{ P{Ci = l|Ci_i = 1) ~ 0.9). 

Cluster error model: the second explicitly models clusters. A sen¬ 
tence is a sequence of clusters of correct words and clusters of incorrect 
words: C\, ..,C n . By definition if a cluster contains correct words, the 
next cluster will contain incorrect words and vice versa. Let C\ be 
the correctness of words in the i-th cluster. P(length(C.)\C. = 0) and 
P(lengt,h(C)\C, = 1) were estimated on a hold-out set of 50 machine 
translations annotated by a human. Sequences of zeroes and ones were 
generated accordingly. This model’s parameters cannot theoretically be 
represented by a finite set of real number (they are distributions over 
N). In practice, cluster lengths are bounded therefore these distribu¬ 
tions are actually over {0, .., max (length((C_)))}. Just to give an idea, 
we found that the average length of a cluster of wrong words was 1.9 
x P{length{C .) = k\C. = 0) = 1.9), and that of a cluster of 
correct words is 12 . 2 . 

Once the exact location of errors was known, we randomly intro¬ 
duced errors of five types: move, deletion, substitution, insertion 
and grammatical error. “Deletion” is straightforward: a word is cho¬ 
sen randomly according to a uniform distribution and deleted. “Move” 
is not much more complicated: a word is chosen at random accord¬ 
ing to the uniform distribution, and the distance it will be moved 
(jump length) is chosen according to a probability which is uniform 
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within a given range (4 in our experiments) and null beyond. “Gram¬ 
matical” errors are generated by modifying the ending of randomly 
selected words (“preserving” may become “preserved”, “environment” 
may become “environmental”). “Substitution” and “insertion” are a 
little more subtle. Given the position i of the word to be replaced or 
inserted, the probability of every word in the vocabulary was computed 
using an IBM-1 translation model and a 5-gram language model: 

£ Vj 1 . p(t ) — PlBM - l(t js^) X P 5 — gram.(t \ ti — 4> •■ Pi — l) 

The new word t ! was then picked among all w at random according 
to the distribution p. This way the generated errors were not too “silly”. 
WordNet (Miller, 1995) was used to check that t' was not a synonym 
of t (otherwise it would not be an incorrect word): t ! could not belong 
to any synset of which t is an element. The algorithm was controlled 
by several parameters, which were empirically chosen: 

— probability distribution P m of the proportion of move errors in a 
sentence and probability distribution Pj of jump length 

— probability distribution Pd of the proportion of deletions 

— probability distribution P s of the proportion of substitutions 

— probability distribution Pi of the proportion of insertions 

— probability distribution P g of the proportion of grammatical errors 

We chose triangle shaped distributions with mode = 0.2, minimum = 
0 and maximum = 0.5. These may not be the real distributions but 
seemed reasonable. The positions of words to be moved, deleted, in¬ 
serted or modified were chosen according to uniform distribution prob¬ 
ability. For each sentence errors were inserted in the order given pre¬ 
viously — firstly, words were moved, then some were deleted, etc. 
Eventually we obtained a corpus with an average 16% word error rate, 
which approximately matches the error rate of real machine translation 
output. 

Below is an example of degraded translation obtained using this 
method, extracted from our corpus: 


source sentence 

Quant a eux, les instruments politiques doivent 
s’adapter a ces objectifs. 

reference translation 

Policy instruments, for their part, need to adapt to 
these goals. 

degraded translation 

Policy instruments, for the part, must to adapt to 
these goals. 


cm-springer-utf8.tex; 22/06/2011; 16:20; p.16 




17 


We used 40,000 pairs of sentences (source: French - target: English) 
from WMT-2008 evaluation campaign data. We degraded the reference 
translations according to the above rules. We found that the bigram 
error model gave the best results in the end (classification error rates 
of confidence measures trained on such data are lower) so we used it 
for all experiments presented here. The BLEU score of the degraded 
corpus was 56.5 which is much higher than the score of our baseline 
described in Section 4.1 (21.8). The latter score was underestimated 
because only one reference translation was available. However this phe¬ 
nomenon did not affect the BLEU score of the degraded corpus as it 
came directly from the reference sentences and therefore there was no 
need for multiple references. The error rate in the degraded corpus was 
set to 16% to match that of real machine translation output. 

Others have proposed the use of artificial corpora, for example (Blatz 
et al., 2004) and (Quirk, 2004). While we found that automatically 
generated corpora yield performances comparable to that of expert- 
annotated ones (Section 6.2), the later draw conclusions opposed to 
ours, as they found that a classifier trained on a small, human-annotated 
corpora, performs better than one trained on a large automatically 
annotated corpora. However in their experiment sentences are not au¬ 
tomatically generated but automatically annotated. It is important 
to understand that automatically generated data is not the same as 
automatic annotation. In the latter, sentences are realistic but there 
is uncertainty concerning annotation. On the other hand, while auto¬ 
matically degraded sentences may seem less realistic, there is almost 
no doubt that words labelled as incorrect are actually wrong, and vice 
versa. Therefore automation plays a completely different role in their 
system and ours. Another difference is that they are evaluating sen¬ 
tences, while an important task for us is to evaluate words. In Section 
6.2 we present an experiment showing that a classifier trained on our 
large artificial corpus yields better results than one trained on a small 
human annotated corpus (Figure 4), for a fraction of the cost. 


5. Experimental framework 

A single feature (for example, n-grarn probability) can be used as a 
confidence score. It is then relatively simple to evaluate its performance 
because no neural network or similar machine learning machinery is 
necessary. Each word or sentence is attributed a score and a DET 
curve can be immediately computed. Computing NMI is a slightly 
more subtle operation because a probability is needed here, and not all 
predictive parameters qualify as such. In this case the score is turned 
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into a probability by logistic regression (Section 2.1) whose parameters 
are learnt from artificial data. 

Combining several predictive parameters is a little more compli¬ 
cated. Unless otherwise specified we proceeded as follows: two artificial 
corpora T\ (for “training”) and V (“development”) were used to find 
the best nreta-paranreters with regard to EER for SVM (7 and C — see 
Section 2.2) and Neural Networks (number of hidden units — Section 
2.3). Once optimal nreta-paranreters were found (or if none was set) the 
classifier was trained on a larger set of automatically generated data T 2 
and finally tested on unseen real machine translation output U. Then, 
if relevant, bias was estimated on a corpus of automatically generated 
data B. T\,T 2 ,T> and B consisted of 10,000 sentences each (around 
200,000 words). U consisted of 150 sentences, or approximately 3,000 
words, each of them having one reference translation (Section 4.1). 


6. Word-level confidence estimation 

We shall now look into the details of the predictive parameters we 
used (the components of the vector x(S, T, j)) for word-level confidence 
estimation. These components will be noted Xi n d e x where index is the 
label of the equation so that they are easier to find and refer to in the 
paper. Altogether these features are a numerical representation of a 
word in the target language (Tj), its context (the whole sentence T), 
and the source sentence S the translation of which it is supposed to be 
a part. Of course this representation is less expressive than the original 
natural words and sentences, but hopefully it is more accessible to 
probability estimation while still carrying enough information to enable 
us to determine whether a word is correct or not. 

Some of these features can themselves be used as confidence mea¬ 
sures (for example LM-based features). In this case, we provided per¬ 
formance evaluation. Others cannot, for example Part-Of-Speech tag, 
stop word indicator and rule based features. 

6.1. Features for word level confidence estimation 

6.1.1. N-gram based features 

N-grarn scores and backoff behaviour can provide a great deal of use¬ 
ful information. First, the probability of a word in a classical 5-granr 
language model can be used as the feature: 

* 2 i(S,T,j) = (21) 
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Intuitively, we would expect an erroneous word to have a lower n- 
grarn probability. However this feature is generally already used in 
statistical translation systems, therefore even the probability levels of 
wrong words may not be too low. 

Backward 5-gram language models, proposed for speech recognition 
confidence estimation in (Duchateau et al., 2002), also turned out to 
be useful: 


£22(s, T,j) = P(tj\tj+i,tj +2 ) (22) 


This feature has the advantage of generally not being used in the de¬ 
coding process. 

Finally the backoff behaviour of the 5-gram and backward 5-gram 
models are powerful features: an n-grarn not found in the language 
model may indicate a translation error. A score is given according to 
how many times the LM had to back off in order to assign a probability 
to the sequence, as proposed in (Uhrik and Ward, 1997) for speech 
recognition: 


* 23 (S,T,j) 


1.0 if tj- 2 ,tj-i,tj exists in the model 
0.8 if tj- 2 ,tj-i an d tj-i,tj both exist in the 
model 

0.6 if only tj-i,tj exists in the model 
< 0.4 if only tj- 2 ,tj-i and tj exist separately in (23) 

the model 

0.3 if tj- 1 and tj both exist in the model 
0.2 if only tj exists in the model 
0.1 if tj is completely unknown 


Figure 1 shows DET curves of the confidence measures based on 
5-grams and backward 5-grams and scores and backoff behaviour. 
While 5-grams and backward 5-grams are almost indistinguishable, 
backoff behaviour performs better in terms of EER. Although this 
measure is very simple, it is less correlated with those used in the 
decoding or degrading process, which may explain why it achieves 
better discrimination results. The results are summarised in Table I. 


Table I. Performances of 3-grams based confidence measures at word level. 


feature 

equal error rate 

normalised mutual information 

3-grams 

42.1 

4.86 x 10 _a 

backward 3-grams 

42.9 

-3.93 x 10“ a 

backoff 

37.0 

6.11 x 10~^ 

backward backoff 

38.1 

1.09 x 10" 2 
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Figure 1. DET curves of 3-grams based confidence measures at word level. 


NMI of backward 5-gram scores is negative. This is theoretically 
not possible but may be explained by a strong bias in the estimation 
of probabilities which our unbiasing method was unable to efficiently 
remove (Section 1.3.2) and because NMI was only approximated here 
(Section 3.2). 


6.1.2. Part-Of-Speech based features 

Replacing words with their POS class can help to detect grammatical 
errors, and also to take into account the fact that feature values do 
not have the same distribution for different word classes. Therefore we 
used syntactic POS tags as a feature, along with the score of a word 
in a POS 5-grams model. Tagging was performed using GPoSTTL, an 
open source alternative to TreeTagger (Schmid, 1994; Schmid, 1995). 


X24(S,T,j) = POS(t j) (24) 

x 25 (S,T,j) = PlPOSltj^POSltj-^POSltj-i)) (25) 

With our settings, POS is a non numeric features which can take 44 
values, say { 7 Ti, 7144 }. In order to combine it with numeric features it 
was mapped to a vector n(tj) £ {0,1} W with N = 40, as suggested in 
(Hsu et al., 2003). The mapping is defined by 

TrttMi} = { 1 ZPOS{t j ) = ir i 
' 3 ^ ■* | 0 otherwise 


We have chosen not to show the individual results of these confidence 
measures as they are only useful in combination with others. 
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6.1.3. Taking into account errors in the context 

A common property of all n-grarn based features is that a word can 
get a low score if it is actually correct but its neighbours are wrong. 
To compensate for this phenomenon we took the average score of the 
neighbours of the word being considered into account. More precisely, 
for every relevant feature x . defined above (# 21 , £ 22 , £ 23 , £ 25 ) we also 
computed: 

x left (S,T ,j) = xXS,T,j-2)*xXS,T,j-l)*xXS,T,j) 

x centred {S T j) = x ( S , T , j — 1) * X. (S, T, j) * X.(S, T, j + 1) 

x right (S,T,j) = xXS,T,j)*xXS,T,j + l)*xXS,T,j + 2) 

These predictive parameters were then combined using a neural net¬ 
work. Figure 2 and Table II show a vast improvement when using the 
product of 5-gram probabilities of words in the centred window. 



Figure 2. DET curves of 3-grams score combined with neighbours score at word 
level. 


Table II. Influence of taking the context into account. 



equal error rate 

normalised mutual information 

3-grams 

42.1 

4.86 x 10“ 3 

3-grams and neighbours 

36.3 (-5.7) 

4.57 x 1CT 3 


However NMI was slightly harmed in the process. This may be be¬ 
cause the product of 5-gram scores on the window was not a proper 
estimation of probability of correctness. However it is perfectly possible 
to have a confidence measure with good discrimination power and low 
NMI. 
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6.1.4. Intra-lingual mutual information 

In (Raybaud et al., 2009a; Raybaud et al., 2009b) we introduced 
original predictive features based on mutual information. Mutual in¬ 
formation is a metric for measuring how much information a random 
variable gives about another. Here we consider two random variables 
whose realisations are words, say W\ and W 2 : 


I(W 1 , W 2 ) = J2 P ( W i = wi,W 2 = w 2 ) 

W 1 ,W2 

. , ( P{W\ = wi,W 2 = w 2 ) \ 
x 09 \P(W X = w 1 )P(W 2 = w 2 ) J 

We used point-wise mutual information which is the contribution of a 
specific pair of words to the mutual information between W\ and W 2 
(that is, a single term of the sum above). 


M i (wi,W2) = Pm = »„ W 2 = w 2 )lo„ (PPP ffPP = ”^)) 

The tuple (wi,w 2 , MI(wi,w 2 )) is called a trigger. Triggers are learnt on 
an unaltered bilingual corpus. The idea of using mutual information for 
confidence estimation was first expressed in (Guo et al., 2004). It has 
since been proved useful for computing translation tables (Lavecchia 
et al., 2007). 

Intra-lingual mutual information (IMI) measures the level of sim¬ 
ilarity between the words in a generated sentence thus assessing its 
consistency. Formally W\ and W 2 are any T) and Tj here (words of 
the translation hypothesis). Let J be the length of the translation 
hypothesis. The feature for confidence estimation is: 

x 26 (S,T, j) = j- X/ MI(ti,tj) (26) 

1 


6.1.5. Cross-lingual mutual information 

Cross-lingual mutual information (CMI) is similar to the previous intra- 
lingual mutual information in that it assesses the source-translation 
consistency. Let I be the length of the source sentence: 

x 27 (S,T ,j) = jJ2 MI (si,tj) (27) 

1 l<i<I 

Here W\ and IL 2 are any Si and Tj. 
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Table III summarises the performances of Mi-based features when 
used as confidence measures by themselves. Although it performs poorly, 
we will see that they are useful when combined with other predictive 
parameters (Section 6.2). 

Table III. Performances of mutual information based features at word level. 


feature 

equal error rate 

normalised mutual information 

intra-lingual 

45.8 

9.46 x 1CT 4 

cross-lingual 

45.7 

—2.21 x icr' 


6.1.6. IBM-1 translation model: 

This feature was proposed in (Blatz et al., 2004; Ueffing and Ney, 
2005): 


1 

£2 s(S,T ,j) = (28) 

where so is the empty word. The performance of this predictive pa¬ 
rameter used alone is given in Table IV. Once again the results are 
disappointing. The results are extremely similar to alignment probabil¬ 
ity (the sum is replaced by a max). It is surprising to note that even on 
a translation evaluation task, measures involving only the hypothesis 
yield better performances than those taking the source sentence into 
account. 

Table IV. Performance of IBM-1 based confidence measure at word level. 


feature 

equal error rate 

normalised mutual information | 

IBM-1 score 

45.0 

-1.84 x 10 _;i j 


Like Mi-based features, IBM-1 does not work very well when used as 
a confidence measure and will only be used in combination with others. 


6.1.7. Stop words and rule-based features 

The “stop word” predictive parameter is a simple flag indicating if the 
word is a stop word (the,it,etc.) or not. It helps a classifier to take into 
account the fact that the distribution of other features is not the same 
on stop words and on content words. This feature is less informative 
than Part-Of-Speech, but simpler. 


Z29(S,T, j) 


1 if tj is a stop word 
0 otherwise 


(29) 


cm-springer-utf8.tex; 22/06/2011; 16:20; p.23 





24 


The stop list was generated by picking words that are both short 
and frequent. Finally we implemented four binary features indicating 
whether the word is a punctuation symbol, numerical, a URL or a 
proper name (based on a list of those). These features were not of 
course designed to be used as standalone confidence measures. 

6.2. Features combination 

Altogether we had 66 features for word-level confidence estimations, 
many of them very similar (for example 5-grams probability and av¬ 
erage 5-grams probabilities on different windows), some very crude 
(for example sentence-level features like length ratio — Section 7.1.5 
- used at word level). We trained four classifiers (Logistic Regres¬ 
sion, Partial Least Squares Regression, Support Vector Machines and 
Neural Network) to discriminate between correct and incorrect words 
based on these features. Only Neural Networks brought a consistent 
improvement over the best feature used alone (5-gram scores on a 
centred window, Sections 6.1.1 and 6.1.3) for the classification task, 
although this was not a large improvement (-1.3 point of EER). The 
DET curve of neural networks is presented in Figure 3 and the results 
are summarised in Table V. 



Figure 3. Combination of all features by neural network. 


The network used was a fully connected three-layer perceptron with 
66 input nodes, 33 hidden nodes and one output node. The activation 
function is sigmoid. 

The NMI results were especially disappointing. As explained in Sec¬ 
tion 3.2, NMI is harmed by bias. Although we estimated bias on a 
dedicated set of training data and removed it from the final estimation, 
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Table V. Performances of all word level features combined. 


classifier 

equal error rate 

NMI 

training time 

testing time 

Logistic Regression 

36.8 

-2.61 x 10 -2 

13” 

5” 

PLSR 

37.5 

-5.84 x 10~ 2 

15’ 

1” 

SVM 

36.7 

-1.87 x lO^ 1 

12h 

500” 

Neural Network 

35.0 

6.06 x 10~ 2 

10’ 

2” 


we believe that the poor performance may perhaps be explained by 
the fact that bias is very different for artificial and natural data and 
probably much more important on the latter. 

In order to evaluate the performance gain brought by automatically 
generated training corpus, we also split the annotated sentences into 
a training set (70 sentences) a development set (30 sentences) and a 
testing set (50 sentences), on which we trained and evaluated the neural 
network. Figure 4 and Table VI show that training on annotated data 
do not yield better results than training on the generated corpus. The 
natural corpus is small, but it must be noted that the artificial corpus 
was generated in a few hours, while it took more than one day to an¬ 
notate all the sentences. In addition, human annotations are subject to 
time and inter-annotator variations. Employing a trained professionnal 
may aleviate these problems but is of course more expensive. 



Figure 4- Training neural network on annotated or generated corpus. 

In Table VII we show the modest contribution of mutual informa¬ 
tion (Sections 6.1.4 and 6.1.5) to the performance of neural network 
combination of the features. 
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Table VI. Performances of all word level features combined. 


classifier 

equal error rate 

NMI 

NN trained on generated corpus 

35.0 

6.06 x 10“ 2 

NN trained on annotated corpus 

36.8 

5.79 x 10~* 


Table VII. Contribution of mutual information based confidence measure to overall performance 



equal error rate 

NMI 

without IMI and CMI 

35.6 

5.32 x 10"* 

with IMI and CMI 

35.0 

— RoeWTcr^ - 

improvement 

- 0.60 

+ 7.4 x 10 


7. Sentence-level confidence estimation 


The features described in this Section form a numerical representation 
of a pair made up of a source sentence and a target sentence. As in the 
previous section, our aim was to compute the distribution of probability 
of correctness on the numerical space (a subspace of ffi dsentence ). Unlike 
word level, the algorithm for generating degraded sentences cannot 
reliably tell if a degraded sentence is still correct or not. We got around 
the problem of creating a corpus for training classifiers (Section 7.2) but 
we could not automatically generate a corpus for estimating probability 
biases. Therefore all normalised mutual information is poor. 

Many word-level features can be extended to sentence level by arith¬ 
metic or geometric averaging, for example IBM-1 translation probabil¬ 
ity, n-grarn probability, etc. 


7.1. Features for sentence level confidence estimation 


7.1.1. LM based features 

The first features we propose are sentence normalised likelihood in a 
5-gram model (forward and backward) and average backoff behaviour: 


Z3o(s,t) 


x 3 i(s,t) 


P(tj\tj- 1 , 

i=i 




j-n+1 


3 =1 



l 

J 


J 


(30) 


(31) 
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® 32 (s,t) = i^x 23 (S,T,j) (32) 

3 = 1 

They can also be used as confidence measures by themselves and their 
performances are presented in Table VIII and Figure 5 together with 
intra-lingual mutual information, another kind of language model. 


Table VIII. Performances of 3-gram and backoff based confidence measures at sentence-level. 


feature 

equal error rate 

normalised mutual information 

3-gram normalised likelihood 

41.7 

4.02 x 10~ a 

backward 3-gram normalised likelihood 

41.3 

3.97 x 10“ a 

averaged backoff behaviour 

34.2 

4.15 x 10 _a 


The following predictive parameter is the source sentence likelihood. 
Its aim is to reflect how difficult to translate the source sentence is. It 
is obviously not designed to be used alone. 


®33(s,t) 


IP( I? n+1) 


\i= 1 


(33) 


7.1.2. Average mutual information 

= V MI W < 34 > 

*=1 1 <j¥=i<J 

3 =1 

1 1 J 

= ( 35 ) 

i= 1 j =1 

1 J 

= -j ^2 X27 ( s ’ t,j) 

J 3 = 1 

We were surprised to observe that cross-lingual MI performed even 
worse at sentence level than at word level. We have only presented the 
results for intra-lingual MI in Figure 5 and Table IX, as its performance 
was closer to other standard confidence measures than it was at word 
level. 


x 34 (s,t) 


Z35(s,t) 
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Table IX. intra-lingual mutual information CM as a sentence-level confidence measure. 


feature 

equal error rate 

normalised mutual information 

IMI 

39.0 

9.46 x 10 -4 



Figure 5. DET curves of 3-gram, backoff and intra-lingual mutual information based 
confidence measures at sentence level 


7.1.3. Normalised IBM-1 translation probability 

The score of a sentence is its translation probability in IBM model 1, 
normalised to avoid penalising longer sentences: 


®36(s,t) 


n E n 

*=1 j =o 



1 

I 


(36) 


As was the case at word level it is surprising to note that although 
the system was tested on a translation task, confidence measures involv¬ 
ing the source sentence do not perform better than the ones involving 
only the target sentence. 


7.1.4. Basic syntax check 

A very basic parser checks that brackets and quotation marks are 
matched, and that full stops, question or exclamation marks, colon or 
semi-colon are located at the end of the sentence (Blatz et al., 2004). 


x 37 (s,t) 


1 if t is parsable 
0 otherwise 


(37) 
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This feature and the following are only pieces of informations about 
the source and target sentences, they are not confidence measures 
themselves. 

7.1.5. Length based features 

These very basic features reflect levels of consistency between the lengths 
of a source sentence and of its translation (Blatz et al., 2004). The idea 
is that source and target sentences should be approximately of the same 
length, at least for language pairs such as French/English: 


Z38(s,t) 

= Len(s) 

(38) 

Z39(s,t) 

= Len(t) 

(39) 

X 4 o(s,t) 

Len(t) 

Len(s) 

(40) 


7.2. Combination of sentence-level features 

As explained earlier in the paper, a generation algorithm cannot tell 
which sentences are to be considered correct and which are not. There¬ 
fore, for sentence-level confidence, it was not directly possible to train 
classifiers to discriminate between correct and incorrect sentences. In¬ 
stead, we used SVM, Neural Networks and Partial Least Squares (PLS) 
to perform regression against sentence level BLEU score 3 . Sentences 
were then classified by thresholding this score. 

Table X. Performances of PLSR, SVM and Neural Nets at sentence level. 


feature 

equal error rate 

NMI 

PLS 

29.0 

8.14 x 10~ 3 

SVM 

38.0 

-2.56 x lO^ 1 

Neural Net 

41.3 

-2.44 x 10~ 3 


Only PLS was found to improve (by 5.2 points, absolute) on the 
best stand alone confidence measure (Average backoff behavior, Section 
7.1.1). Its correlation coefficient with human evaluation was 0.358. 


3 It is true that BLEU is not very suited for sentence-level estimation. It has the 
advantage of being a well known automatic metric for which efficient toolkits are 
available. We also experimented with TER but too many sentences produced a null 
score 
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Figure 6. DET curves of PLS and Neural Network combination of sentence level 
features 


8. Preliminary Post Edition Experiment 

The previous sections have given a detailed explanation of how the 
proposed confidence measures work and the amount of errors they 
are able to detect. In this section we will describe a more subjective 
usability experiment. Our aim was to obtain qualitative feedback from 
real users of the system about the usability of confidence measures for 
assisted post edition. Because of the limited number of subjects, and the 
fact that many predictive parameters are still work-in-progress, these 
results are only to be interpretated as hints at what users want and find 
useful, at what we did right or wrong and at which direction we should 
follow in our research. The experimental protocol is inspired by the one 
described in (Plitt and Masselot, 2010). We implemented a post edition 
tool with confidence measures and let users correct machine translated 
sentences, with and without the help of confidence measures. 

8.1. The post edition tool 

The program we developed (see screenshot in Figure 7) can be seen 
as a simplified version of a tool for Computer Assisted Translation. It 
displays a source sentence (in our case, in French) and a translation 
generated by Moses (in English). Errors detected by the confidence 
measures are highlighted. The user can then opt to edit the proposed 
translation. 

The source sentence is displayed in the top field with the candidate 
translation in the field below. On the left there is a slider with which the 
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Figure 7. Screenshot of the post edition software. 


user can change the acceptance threshold of the confidence estimation 
system (Section 1.3.1). All words with a score below this threshold are 
displayed in red. Simplified explanations are given to the user, who 
does not require a full “lecture” on confidence estimation: he or she is 
told that s/he may use an automatic help to detect erroneous words, 
and that the requested quality can be changed with this slider, if s/he 
so wishes. Of course, if his/her quality requirements are too high (cor¬ 
responding to a threshold value of 1, that is, the point to the far right 
on the DET curve — Section 3.1), the system will incorrectly consider 
all words to be wrong. The user can edit the candidate translation if 
s/he thinks it is necessary. When s/he is satisfied with the translation 
s/he has to click on ’’next”. For the sake of the experiment the user 
may not come back to a sentence that has already been validated. If 
required, the user can click on ’’pause” to take a break thus avoiding 
that the program continues counting the time spent on the translation, 
thus making time statistics meaningless. However none of the users ever 
took a break. Everything else on this GUI is cosmetic (progress bar, 
etc.). 

The total time spent on each sentence was recorded (the time be¬ 
tween the loading of the sentence and clicking on the “next 11 button). 
This is actually the sum of three partial times, which are also recorded: 
time typing on the keyboard, time spent on the interface (moving the 
acceptance slider) and thinking time (the rest). 

It should be noted that the proposed translation and confidence 
scores were not computed on the fly, in order to keep the program 
responsive and easily portable. This is quite a heavy constraint because 
the system cannot take the user’s editions into account to compute a 
new, improved translation, and cannot compute the confidence of the 
post edited translation (our users were of course informed of that). Also, 
while all users stated that the program was easy to use, an ergonomist 
input would be required to ensure we made the right choices with 
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regard to usability and that what we measure is really the influence 
of confidence measures and is not due to influence of the interface. 

8.2. Experimental protocol 

Because we were not expecting many volunteers, we wanted their En¬ 
glish skills to be as homogeneous as possible (all of them are French 
native speakers) in order to limit the variability of the results. Seven 
subjects volunteered for the experiment. Six of them are English teach¬ 
ers and one is a master student in English. Unfortunately two of them 
failed to correctly follow the instructions and the corresponding data 
was discarded. The experiment lasted approximately two hours, divided 
in four stages: 

First stage: introduction and training. The users were provided with 
some basic explanations about the domain and the task and given ten 
sentences to post edit along with simple instructions (see below). These 
sentences were just for training purposes and were not included in the 
final results. 

Second stage: first experiment. The users were told to start the first 
experiment when ready. They were given 30 sentences with their corre¬ 
sponding machine translations and were told they could post edit these 
translations with the help of the confidence measures. 

Third stage: second experiment. This experiment was identical to the 
first, except that the users did not have access to confidence measures. 
One volunteer out of two had the second experiment before the first, in 
order to compensate for the ” training effect” (users complete the second 
experiment faster than the first one) and for fatigue (a user may be tired 
by the time he starts the second experiment, thus affecting post edition 
speed and quality). 

Fourth stage: user feedback. Finally, the users were asked to complete 
a questionnaire, providing us with feedback on the post edition software 
and the confidence measures. 

We gave the following instructions to the users, with the idea that 
translated documents must be good enough to be read without extra 
effort, but not necessarily in beautiful, idiomatic English: 

— The goal is to obtain a correct translation, not necessarily a very 
fluent one. Fix mistakes, not style. 
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— You can use any help you want (most of them actually used paper 
or online dictionaries) but: 

• Don’t use an online tool to re-translate the sentence 

• Don’t spend too much time on details 

• Don’t ask the supervisor for help 

The sentences were random subsets of the test set of the WMT09 
campaign, which is transcripts of news broadcast. Each user had to post 
edit two randomized sets of thirty sentences. This choice is questionable 
insofar as most ’’real life” applications consist of translating whole 
documents and not a sequence of sentences without connections to 
each others. However we chose randomized subsets so that the intrinsic 
difficulty of the task did not influence the results. 

8.3. Results and Analysis 

Table XI summarises the most important results of the experiments. 
Most of these metrics are straightforward but some are worthy of more 
explanation. 

Sentence quality: After the experiment, all the post edited transla¬ 
tions were scored by a team member, a native French speaker also 
fluent in English. Each sentence received a score between 1 and 5 in 
the same fashion as in StatMT evaluation tasks: 

1. the translation is completely unusable. 

2. the translation is seriously faulty but a degree of meaning can be 
grasped. 

3. the translation is usable although not very good. 

4. the translation has minor flaws. 

5. the translation is very good. 


Correlation between confidence estimations and editions: our aim here 
was to check how the user’s decisions and the machine predictions corre¬ 
lated. To this end every word in the machine generated hypothesis was 
mapped to 1 if it was Levenshtein aligned to a word in the edited hy¬ 
pothesis (which means it was not modified), 0 otherwise (which means 
it had been inserted or modified by the user). The corpus was therefore 
mapped to a sequence of 0 and 1 and we computed the correlation 
between this sequence and the estimated probabilities of correctness. 
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Ratio of number of editions over number of detected errors: this is the 
ratio of the number of edits made to the original hypothesis over the 
number of errors which were detected by the system. A high ratio sug¬ 
gests that the user could not find an appropriate trade-off between false 
positives and false negatives and had to lower his quality requirement 
(using the slider) in order to get an acceptable level of accuracy. 


Table XI. Effect of confidence estimation on a post edition task. 



without CM 

with CM 

Average time per sentence (seconds): 

77 

87 

Average edit rate: 

30% 

32% 

Average sentence quality: 

4.3 

4.2 


1st experiment 

2nd experiment 

Average time per sentence: 

84.22 

80.12 

Average edit rate: 

0.29 

0.33 

Average sentence quality: 

4.2 

4.3 

Ratio of corrections/detected errors 

1.76 

Correlation between CM and editions 

0.23 


While the results in terms of translation speed are disappointing 
(Table XI), this experiment was primarily designed to obtain a quali¬ 
tative feedback from real users of the system. This is what the following 
analysis will focus on, in order to determine what must be improved 
and how. A finer grained analysis showed that the time difference is 
entirely due to “thinking” time. User feedback confirmed that they 
thought the help was not reliable enough to be useful and that even if it 
sometimes drew their attention on some mistakes, checking the systems’ 
recommendations wasted their time. However it must be noted that 
users were significantly faster during the second post edition task than 
the first. This suggests that more training is needed before users grow 
accustomed to the task and really see the program as a tool instead of 
a constraint. We believe that an experiment involving more users over 
a longer time frame is necessary. The consistently high and comparable 
edit rate with and without confidence measures suggests — and this 
is confirmed by feedback — that a lot of editing was required, but the 
high ratio of number of corrections over automatically detected errors 
suggests that confidence measures were not able to precisely discrim¬ 
inate between correct and incorrect words. Regardless of confidence 
estimation, many of our users stated that they would rather translate 
a sentence from scratch than edit a flawed machine translation. 

As a conclusion to this experiment, we propose the following direc¬ 
tions for further improvements and experiments: 

— The users should be given a consistent task, not random sentences. 
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— Users need longer training time as some of them were still not sure 
what to do with the slider by the end of the tasks. Measurements 
show that their efficiency continued to increase after the training 
stage. We believe they need more time to get used to the tool and 
make the best of it. 

— The program interface needs to be carefully designed with er¬ 
gonomics in mind in order to really measure the influence of con¬ 
fidence measures and not that of the GUI. 

— We need more reliable confidence measures and above all, we greatly 
need to focus on precision rather than recall as we observed that 
false alarms were very disconcerting for users 


9. Conclusion 

After introducing and formalising the problem, we presented a method 
which makes it possible to generate large amount of training data, 
then developed a list of predictive parameters which we consider are 
some of the most significant for confidence estimation, including two 
original measures based on mutual information. We compared different 
machine learning techniques combining the features we proposed. From 
these features, we consider Neural Networks and Partial Least Squares 
Regression to be the best suited, depending on the application. We have 
shown that combining many features improves over the best predictive 
parameters alone, by 1.3 points (absolute) EER at word level and 6 
points at sentence level on a classification task. Finally we presented 
an experiment aiming at measuring how helpful confidence estimation 
is in a post edition task. This experiment suggested that our confidence 
estimation system is not mature enough to be helpful in such a setting. 
However the limited number of volunteers and the lack of long term 
observations makes the results somewhat difficult to interpret. But the 
knowledge we gained from this experiment and users feedback will help 
us improve confidence measures for the benefit of future users. 

Our hope is that this paper will provide the necessary information 
to enable the construction of a complete confidence estimation system 
for machine translation from scratch and facilitate the incorporation 
therein of new predictive features. In addition to assisted post edition 
we believe there are many useful applications for confidence estimation, 
namely: 

— Warning a user that the translation he requested may be flawed 
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— Automatically rejecting hypotheses generated by the decoder or 
combining several systems in a voting system 

— Recombining good phrases from an n-best list or a word graph to 
generate a new hypothesis. 

We have also identified important research directions in which this 
work could be extended to make confidence measures more helpful 
for users. Firstly, we would cite computing confidence estimates at 
phrase level which would enable users to work on semantically con¬ 
sistent chunks while retaining a more fine-grained analysis than with 
sentences. Secondly semantic features could be introduced which would 
make it possible to detect otherwise tricky errors like missing nega¬ 
tions and help users to focus on meaning errors rather than gram¬ 
matical errors and disfluencies which are, in some cases, arguably less 
important. 
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